Noblis Team - Protovis and Processing

VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease

Authors and Affiliations:


Catherine Campbell, PhD, Noblis, Team Lead, catherine.campbell@noblis.org [PRIMARY contact]

Seth Blanchard, Noblis, seth.blanchard@noblis.org
Mitchell Holland, Noblis, mitchell.holland@noblis.org
Jill McCracken, Noblis, jill.mccracken@noblis.org
Harry Cummins, Noblis, graphic artist, hcummins@noblis.org  
Richard P. DiMassimo, video producer, rdimassimo@noblis.org
Austin Blanton, Noblis Intern, austin.blanton@noblis.org

Noblis VAST Webpage: http://www.noblis.org/VAST

Tool(s):

Primary tools used for this project included:

1) SNUFER (http://www.bioinformation.net/003/001300032008.htm) was used to generate SNP tables and was developed by Mozart Marins group at the Unidade de Biotecnologia in Sao Paulo Brazil, 2008:
2) Clustal W (http://www.clustal.org/) was used to align sequences and generate phylogenetic trees. This program was developed by Trinity College, Dublin in 1988
3) Perl (http://www.perl.org/) was used to develop scripts to organize tables.
4) R (http://www.r-project.org/), a statistical package, was used to calculate the significance of SNPs.
5) Processing (http://processing.org/) was used to develop interactive SNP plots. This is an open source design language started by Ben Fry and Casey Reas in 2001.
6) Protovis (http://vis.stanford.edu/protovis/) was used to develop interactive pedigree trees in a space saving format. This is an open source Javascript based visualization toolkit developed at Stanford and released as an open source tool in 2009.

With the exception of SNUFER and Clustal W, which can be utilized by general biologists and bioinformaticians, the rest of these toolkits require some programming ability. Scripting in Perl and Processing have low learning curves to start, but complex visualizations will take users time to learn. However, once scripts are developed, any end user-biologist, analyst, or bioinformatician can interact with the visualizations with little or no training. Protovis has a longer learning curve than Processing, but the interactive visualizations can be used by any end user. R requires the most training, both in statistics and scripting. This tool can be adapted to non-programmers through the development of user-friendly interfaces.


Video:
 
Noblis_Processing_MC3.mp4


ANSWERS:

MC3.1: What is the region or country of origin for the current outbreak? Please provide your answer as the name of the native viral strain along with a brief explanation.

We identified Nigeria as the origin of the current outbreak. Twelve SNPs differentiate Nigeria_B from the closest outbreak sequence (531). We used a combination of SNUFER to identify SNPs, Perl to organize tables, and pedigree analysis (Figure 3.1.1) manually drawn with Microsoft Visio, for lineage visualization. The pedigree shows the Drafa Fever virus lineage in Africa and links Nigeria_B to sequence 531. The pedigree is vertically scaled to, and highlights, the number of SNPs that change between sequences (numbers in arrows). Additionally, the temporal and geographic progression of the disease across Africa was plotted (Figure 3.1.2) based on the data in the pedigree. Numbers on the map correspond to those on the pedigree and circles scale to the geographic area of the sequence name. This entire analysis and visualization required approximately 8 hours.

native_Pedigree
Figure 3.1.1. Pedigree Tree of Native Strains

Africa
Figure 3.1.2. Temporal and Geographic Progression of Drafa Fever Virus

MC3.2:  Over time, the virus spreads and the diversity of the virus increases as it mutates.  Two patients infected with the Drafa virus are in the same hospital as Nicolai.  Nicolai has a strain identified by sequence 583.  One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51.  Assume only a single viral strain is in each patient.  Which patient likely contracted the illness from Nicolai and why?  Please provide your answer as the sequence number along with a brief explanation.

The patient with sequence 123 contracted Drafa virus from Nicolai Kuryakin (Sequence 583). Sequence 123 differs from sequence 583 by a single SNP indicating descendency. A two person team spent four days to create two analytic visualizations, a sunburst plot using Protovis (Figure 3.2.1), and a polar plot using Processing (Figure 3.2.2). The sunburst plot is a novel, space-saving way to depict pedigrees, pulling data from a table of SNPs and automating the visualization process. Descendents radiate out from a central ancestor, and sequence 123 is in the direct lineage of sequence 583. In the polar plot shown in Figure 2 we can drill down to highlight the SNPs that drive the relationships between sequences, highlighting the common SNPs shared between sequences 123 and 583 (red and orange SNPs) contrasted to the blue SNP in sequence 51.

sunburst_patients
Figure 3.2.1. Sunburst Plot Showing Pedigree of Patients

Polar_patients
Figure 3.2.2. Polar Plot Highlighting Patient SNPs

MC3.3:  Signs and symptoms of the Drafa virus are varied and humans react differently to infection.  Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them.  
Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic).  The mutations involve one or more base substitutions.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.
For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,
C -->G, 456 (C changed to G at position 456)
G -->A, 513 and T-->A, 907 (G changed to A at position 513 and T changed to A at position 907)
A-->G, 39 (A changed to G at position 39)

1. T --> C, 842 and A --> T, 946: p-value 2.9 x 10-5 as a pair
2. A --> C, 269: p-value 0.0019
3. A --> G, 223: p-value 0.009

One bioinformatician spent two days examining the association between SNPs and symptom severity using the Mann-Whitney U test (in R). The three sets of significant SNPs numbered above are illustrated by the patient groups circled in orange on the pedigree (Figure 3.3.1, created with Protovis). Individual SNPs driving these clusters were visualized using our polar plot (Figure 3.3.2, created with Processing). We chose to combine SNPs 842 and 946 because the most severe cases had both SNPs and with no samples having SNP 946 alone, it was impossible to determine if 946 alone or in concert with 842 is more significant.

Sunburst_symptoms
Figure 3.3.1. Sunburst Plot Highlighting Pedigree and Symptoms

polar_symptoms
Figure 3.3.2. Polar Plot Identifying SNPs in Sequences

MC3.4:  Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question.  To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.
Consider each virulence and drug resistance characteristic as equally important.  Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions.  In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.
For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,
C ? G, 456 (C changed to G at position 456)
G ? A, 513 and T ? A, 907 (G changed to A at position 513 and T changed to A at position 907)
A ? G, 39 (A changed to G at position 39).

The 3 mutations that lead to the most dangerous viral strains were:

1. T --> C, 842 and A --> T, 946: p-value   0.0004
2. A --> G, 223: p-value   0.0015
3. A --> C, 269: p-value   0.0076

Throughout the genomics challenge, visualizations and statistics have driven our analysis. The outbreak sequences were aligned using Clustal W to identify the origin of the outbreak. SNUFER was used to cluster the sequences and automate the generation of SNP tables based on divergence from Nigeria_B. By transforming and displaying the data in novel ways, we quickly found that only 57 out of the 1404 total nucleotides in this sequence had a SNP present in at least one sample. Perl was used to sort the data and select the 21 SNPs that occurred in at least two—but not in all—outbreak samples.

We used this filtered dataset for further analysis in Excel. In Figure 3.4.1 we show a view of our Excel table in which high frequency SNPs (those occurring in more than 4 samples) are colored black in vertical bars and each patient’s overall severity score (ranging from 1-8) is colored horizontally from green to red (green for 1, yellow for 4, and red for 8). To arrive at our overall severity score, we scored each categorical disease characteristic from 0-2 (Complications—a binary variable—was scored as 0 or 2). We then summed the scores for all characteristics for each patient.

Colorizing our tables immediately illustrated correlations between particular SNPs and severity. From these color patterns it was easy to see that some SNPs, like SNP 161, occur frequently, but in both severe and non-severe cases. Other SNPs, such as SNP 223, only occur in severe cases. Finally, some SNPs such as 22 seem to moderate disease severity. We have made one major assumption in this analysis; once a SNP is present it would rarely if ever back mutate. We have therefore focused only on SNPs that are divergent from Nigeria_B. All of these initial tasks, including the statistics below, required one bioinformatician approximately 40 hours to complete.

Excel
Figure 3.4.1. Excel Table of SNPs by Patient Showing Symptom Severity and High Frequency SNPs

We next statistically verified which mutations were responsible for the most dangerous viral strains. Based on our overall severity score, we examined the significance of individual SNPs using the Mann-Whitney U test in R. For each SNP test we created two patient vectors using the overall severity scores; one of patients with the SNP, and one of patients without. We found five mutations to be statistically significant (Figure 3.4.2). Two of these mutations 842 and 946 we have previously determined to be highly correlated, and we have concluded that both together are responsible for overall disease severity. However, SNP 790 is also slightly correlated with SNP 223, and it has a less significant p-value, thus we have eliminated it from further consideration here.

pvalues
Figure 3.4.2. P-values of SNPS Associated With Disease Severity

Statistics generally lack a visual punchline. We therefore wanted to develop new approaches to visualizing genetic data to illustrate relationships among strains and highlight SNPs that drive these relationships. The plots we developed are interactive and allow users to focus on particular groupings of SNPs. When we first developed polar plots for MC 3.2 we realized that traditional phylogenetic trees did not clearly show patient groupings. As an alternative, we decided to use pedigree plots to illustrate direct ancestry. Most pedigree software packages track inheritance of genetic disorders with parent-child data and do not adapt well to single “parent” viral data. Therefore we initially created pedigree trees manually using Microsoft Visio (Figure 3.4.3). Although these plots provide an alternative way to look at genetic data we wanted to automate these visualizations because these plots do not scale well horizontally, and are difficult to render manually for more than a few dozen patients.

pedigree
Figure 3.4.3. Pedigree Chart of Patients Colored By Overall Disease Severity

We used Protovis to automate drawing pedigree plots. This toolkit allowed us to create the interactive visualization shown in Figure 3.4.4. The pedigree is now formed in a circular sunburst chart to increase scalability. This script runs as a web application and has a drop down box at the bottom allowing the user to select among the five disease characteristics, which are each displayed with a different color scheme, as well as the choice to color the plot based on our overall severity scale. This plot illustrates the hierarchical relationships among patients, and displays clusters of patients with similar severity characteristics. The plot can be rendered from tabular data and could easily be further developed into a web tool that would allow users to import their own data for visualization. This recolorized plot took one web application developer approximately 1 hour to edit.

sunburst_overall
Figure 3.4.4. Sunburst Plot Showing Pedigree of Patients Colored by Overall Severity

We developed a second visualization to highlight individual SNPs and their relationships to disease characteristics. We used Processing to make a polar plot that interactively views and analyzes all SNPs simultaneously. Figure 3.5.5 shows this plot displaying the outbreak patients on the radii with all 57 SNPs —in order of their location in the sequence—represented in concentric inner circles. This plot can be colorized by any disease characteristic, but is colored here to illustrate overall severity. The radii are colored by severity score to match the sunburst plot, and the SNPs significantly associated with overall severity are also colored. This visualization is interactive with mouse-over information on the left panel showing the current patient ID, SNP changes (and p-values where applicable) as well as the disease characteristics (with severe characteristics highlighted red). Groups of SNPs (like the SNPs shown in red) are easily identified, and different SNPs contribute to separate branches of severe disease. The development of this plot took one bioinformatician and one web application developer approximately 16 hours to develop.

polar_overall
Figure 3.4.5. Polar Plot Showing Significant SNPs Colored by Overall Disease Severity